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INTRODUCTION 


This  report  is  presented  in  four  sections.  In  Section  I  an  analysis  of 
the  effects  of  noise  on  associative  memories  is  presented.  Special  emphasis 
is  given  to  generalized  inverse  memories. 

In  Section  2  of  this  report,  several  algorithms  which  are  appropriate  for 
implementation  on  the  NRL  spatial  light  modulator  were  compiled.  These 
include  nonlinear  associative  memory  models  and  a  pseudo  inverse  memory  model 
that  is  optimum  for  Incomplete  input  patterns. 

In  Section  3  of  this  report  an  analysis  of  a  memory  model  that  is  optimum 
in  the  least  squares  sense  for  input  patterns  with  missing  components  was 
analyzed.  This  memory  was  shown  to  be  derivable  from  a  different  optimization 
principle.  This  optimization  produces  that  memory  which  has  a  minimum  average 
noise  output  for  a  given  error  in  recall. 

In  Section  4  of  this  report  we  discuss  projection  techniques  and  subspace 
methods.  These  approaches  will  allow  the  design  of  dynamic  memory  and  recall 
schemes  that  are  nonlinear,  have  local  connectivity  and  can  be  robust  against 
distorted  input  patterns.  (  Gla.\  { cd(  Cr.  .y  i  CKjSil 
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1.  ANALYSIS  OF  THE  EFFECTS  OF  NOISE  ON  GENERALIZED  INVERSE  AND  CORRELATION 

MATRIX  ASSOCIATIVE  MEMORIES 


Associative  memory  matrices  arfe  constructed  from  pairs  of  vectors.  Each 
pair  consists  of  an  input  vector  and  an  output  vector.  If  the  input  and 
output  vectors  are  the  same  the  memory  Is  auto  associative.  If  they  are 
different  the  memory  to  hetero  associative.  The  input  vectors  have  l 
components  the  output  vectors  have  p  components.  The  number  of  pairs  of 
vectors,  n,  is  typically  less  than  the  number  of  components.  A  correlation 
matrix  memory  is  constructed  as 

M  -  Y  XT  -  l  V 

1  1  1 

the  outer  product  of  the  n  pairs.  A  generalized  inverse  associative  memory  is 
constructed  as 

I  r  I 

M  -  Y  XA  *  l  Y.  X. 

i  1  1 

where  X*  is  a  generalized  inverse  of  X.  The  memories  have  dimension  pxi. 

There  are  several  properties  of  true  Inverses  which  carry  over  to 
generalized  inverses.  Four  have  been  used  to  define  useful  classes  of 
generalized  inverses.1 


The  four  properties  are: 


(a'a)1  -  A  a  (A) 

We  use  the  notation  A^» J *ki  1)  to  Indicate  which  of  the  four  properties  are 
satisfied '( for  example,  A1  satisfies  (1),  A1*1*  satisfies  1  and  A),  The 
Moore  Penrose  pseudo  Inverse  satisfies  all  four. 

A+-  A1’2’3’4 

Associative  memories  using  the  Moore  Penrose  pseudo  Inverse  have  been 
proposed  and  studied  by  Kohonen.2  Any  inverse  which  satisfies  (1)  provides  a 
solution  to  AZ  »  b.  Z0  -  AJb  is  a  particular  solution  and  Z  -  A*b  +  (l-A1A)y 
for  arbitrary  y  Is  a  general  solution. 

Any  inverse  which  satisfies  (1)  can  be  used  to  generate  a  memory  matrix 
with  no  crosstalk. 


Then  M  operating  on  an  exact  copy  of  an  Input  Xj  produces  a  correct  copy  of 
the  output  Y^.  In  correlation  matrix  memories  the  cross  talk  is  a  function 
of  the  overlap  of  inputs.  Since  generalized  inverse  associative  memories 
generate  no  crosstalk  the  output  Y^  for  an  input  X^  with  additive  noise  is 


and 


and  N^0  are  the  input  and  output  noise  respectively.  We  analyze  here 
the  role  that  the  inputs  and  outputs  have  on  the  signal  to  noise  ratios. 

Without  loss  of  generality  we  can  assume  that  both  input  and  output  vectors 
are  normalized  to  unity.  If  inputs  and  outputs  are  all  normalized  to  the  same 
length,  then  the  ratio  of  the  output  noise  to  input  noise  is  equal  to  the 
ratios  of  input  to  output  signal  to  noise  for  the  generalized  inverse 
memories.  For  correlation  matrix  memories,  crosstalk  contributes  to  total 
input  noise.  Since  M  (X  +  N)  *  Y  +  N°  +  C,  the  more  telling  ratio  is 

»N°  +  Cll 
II N II 

for  correlation  memories.  The  strength  of  the  output  noise  versus  the  input 
noise  is  given  by 

BN° II 2  UM  Nil2  NTMTM  N 

i  ii  i  a  —  ■  ■■—  ■  a  —  ■  i  ■  ■  ■ 

UN  ii2  hn  ii  2  NTn 

This  ratio  is  bounded  by  the  maximum  and  minimum  eigenvalues  of  M^M  that  is 

■el 

,  .  II N  82  <  II N 0 II 2  <  .  .IINII2  v'sl 


The  output  noise  will  be  largest  along  the  direction  of  the  eigenvector  of 
M?M  associated  with  and  smallest  along  the  direction  of  the 

eigenvector  associated  with  umjn.  If  the  input  noise  is  white  the  average 
output  noise  to  input  noise  can  be  obtained  by  averaging  the  i  eigenvalues  of 
M^M.  This  average  is  equal  to  the  trace  divided  by  the  number  of 
components. 


»N°a2  -  j  tr  (mtm)  bn  ii  2 


If  there  are  many  zero  eigenvalues,  this  average  will  be  small.  Indeed 
typically  the  number  of  components  in  the  input  vectors,  i,  is  large  compared 
with  the  number  of  pairs,  n,  and  the  rank  of  M^M  will  be  less  than  or  equal 
to  n,  depending  on  the  linear  independence  of  the  inputs.  If  the  inputs  are 
linearly  independent  there  will  be  n  nonzero  eigenvalues  and  the  trace  can  be 

A 

replaced  by  n  times  the  average  of  these  n  largest  eigenvalues,  u 


II N° II 2  *  (j]  U  UN' 


The  n /l  reduction  in  noise  has  been  derived  before  for  pseudoinverse  auto- 
associative  memories.  The  n/ l  reduction  is  obtained  by  any  correlation 
matrix  memory  or  generalized  inverse  memory.  The  memory  need  not  be 
auto-associative  nor  even  square.  The  reduction  depends  on  the  number  of 
components  in  the  input  vectors  and  is  Independent  of  the  number  of  output 
components.  Increasing  the  number  of  components  in  the  input  vectors  for  the 
sake  of  noise  reduction  alone  is  not  advised  however,  as  the  noise  that  is 
reduced  is  noise  that  is  perpendicular  to  the  space  of  the  inputs  and  which 
does  not  contribute  to  the  confusion  among  inputs.  The  input  noise  is  a 
combination  of  perpendicular  and  parallel  noise. 

N  -  N„  +  N| 


~5T 

AC 

ii 


Ft 
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These  contributions  are  the  contributions  in  the  subspace  associated  with  the 

inputs,  Nj ,  and  contributions  which  are  in  the  complement  subspace.  The 

noise  in  the  complement  subspace  is  orthogonal  to  the  subspace  associated  with 

T 

the  inputs.  For  correlation  matrix  memories  Nj.  ■  0  for  all  stored  Inputs. 

For  generalized  inverse  associative  memories  X*  =  0  for  all  stored  inputs 
and  therefore 

MN  -  MN 

H 

MNj_  -  0 

Nj  can  be  expressed  as  a  linear  combination  of  the  stored  inputs  since  the 
inputs  form  a  basis 

Na  *  l  Vi 

i 

Increasing  the  number  of  input  components  (i)  on  the  surface  appears  to  be 
helpful  in  reducing  output  noise.  However  unless  the  added  components  change 
the  eigenstructure  of  MTM  and  reduce  the  crosstalk,  there  is  no  reduction  in 
confusion.  When  the  input  noise  is  white  it  is  partitioned  equally  in  all 

o 

directions  on  the  average  with  average  total  strength  of  a  .  The  partitioned 
between  the  parallel  and  perpendicular  subspace  depends  only  on  their  relative 
dimensions 


n  2 

7  0 


and 


.2  l-  n  2 


It  is  only  parallel  input  noise  that  mixes  inputs  and  causes  confusion.  The 
average  output  noise  strength  depends  on  the  average  nonzero  eigenvalue  of 
MtM  and  the  average  parallel  input  noise  strength 

I  N° ll 2  -  u  UN  j  82 

The  output  noise  strength  in  general  is  given  by 

II  N°  U 2  -  II MN  A2 

and  the  covariance  matrix  for  the  output  noise  can  be  found  from  the 
covariance  matrix  of  the  input  noise  Let  be  the  covariance  matrix  of  the 
input  noise.  Then  R^o  •  MR^M^  is  the  covariance  matrix  of  the  output 
noise.  If  the  input  noise  is  white,  the  output  covariance  is  proportional  to 
the  outer  product  of  the  memory  with  itself  R^o  «  a2MMT.  The  covariances 
and  strengths  of  the  output  noise  is  a  function  of  the  singular  values  of  the 
memory  matrix  or  equivalently  the  eigenstructure  of  M^M  and  MMT.  The 
analysis  that  follows  will  show  that  the  eigenstructure  depends  on  the  metrics 
for  the  inputs  and  outputs.  The  metrics  are  inner  product  matrices  of 
dimension  nxn.  Let  Ax  *  X^X  and  Ay  *  Y^Y.  The  eigenvalues  and 
eigenvectors  of  MTM  can  be  found  by  solving: 

=  j>u 

If  M  is  a  generalized  inverse  memory,  *  (X*)^  AyX*.  Defining  >p 
through  $  -  Xf  yields  the  following  generalized  eigenvalue  problem 


If  M  is  a  correlation  memory  matrix  the  inner  product  is  M^M  ■ 

XTAyX.  Defining  C  through  d>  *  X1^  the  eigenvalue  problem  is  transformed 
to  the  generalized  eigenvalue  problem 

w  -  511 

or  A^5  ■  A^*£y  if  the  ^nverse  exists. 

We  consider  a  few  special  cases  then  solve  the  generalized  eigenvalue 
problem  for  the  general  case. 


Case  1  Orthogonal  Inputs  Ay  °*  1 

In  this  case,  the  generalized  inverse  and  correlation  matrix  memories 
have  the  same  characteristics.  The  nonzero  eigenvalues  of  MTM,  y,  are  the 
eigenvalues  of  Ay. 

For  high  correlation  in  outputs,  some  eigenvalues  can  get  large  and  there 
will  be  large  noise  gain  in  those  directions.  The  trace  of  Ay  is  n  so  there 
will  be  no  increase  in  average  output  noise  strength. 

Case  2  Orthogonal  Outputs  Ay  ~  1 

For  high  correlation  in  inputs  the  y  can  get  very  large  in  some 
directions.  For  the  generalized  inverse  memory  the  eigenvalues  are  the 
eigenvalues  of  Ax~l.  There  will  be  large  noise  gains  in  the  directions 
associated  with  large  y.  The  trace  of  Ax_1  will  be  greater  than  n  and  there 
will  be  a  gain  in  the  average  noise  strength  for  generalized  inverse 
memories.  For  the  correlation  matrix  memory  the  nonzero  eigenvalues  y  of 
MtM  are  the  eigenvalues  of  Ax«  The  trace  of  Ax  is  n.  So  no  gain  in 
average  output  noise  will  occur  for  correlation  matrix  memories. 


Case  3  Auto  Associative  Memory 


Here  the  Inputs  are  the  same  as  the  outputs  X^  =  Y^.  The  generalized 
inverse  memory  M  is  a  projection  operator.  All  of  the  eigenvalues  of  MTM 
are  one  and  its  trace  is  one.  For  the  correlation  matrix  memory  the 
eigenvalues  of  MTM  are  the  eigenvalues  of  (Ax)2.  If  there  are  inputs 
which  are  highly  correlated  there  will  be  large  gains  in  output  noise  strength 

y 

in  those  directions.  The  trace  of  (A^)  will  be  larger  than  n  and  a  net 
gain  in  average  output  noise  strength  will  result. 

Case  4  Heteroassociatlve  But  Av  =  Ay 

This  Is  an  interesting  special  case  in  which  Y  *  X;  and  M  need  not  be 
square.  The  generalized  inverse  memory,  M,  is  not  a  projection  operator  but 
MtM  is  a  projection  operator  with  unit  eigenvalues  and  a  trace  equal  to  n. 

For  the  correlation  matrix  memory,  MTM  has  eigenvalues  again  which  are 

y 

equal  to  those  of  (Ax)  .  The  noise  characteristics  are  identical  to  those 
of  the  autoassocia tive  memory,  Case  3,  for  both  the  correlation  and  the 
generalized  inverse  memories. 

Nonspecial  cases  require  solution  of  the  generalized  eigenvalue  problem. 
An  analysis  and  a  set  of  bounds  on  the  generalized  eignevalues  can  be  obtained 
by  performing  generalized  singular  value  decomposition.  For  a  pair  of 
matrices  A  and  B  which  have  the  same  number  of  columns  the  following 
decomposition  is  possible  and  is  called  the  generalized  singular  value 
decomposition  of  A  and  B. 

T 

A  =  V  aZ 
A 

B  •  V  bZT 

D 
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where  VA  and  Vg  are  unitary  ma trices,  a  and  b  are  matrices  whose  only 
nonzero  elements  lie  along  the  diagonal,  and  Z  is  a  matrix  with  linearly 
independent  columns.  Without  loss  of  generality,  these  columns  can  be  assumed 
to  be  normalized  to  unity.  For  each  column  of  Z,  Z^  there  is  a  pair  of 
generalized  singular  values  a^  and  b^  where 

XT  2  T 

1,  A  AZ .  =  a.  =  Z,  A  Z. 
i  i  i  i  a  i 

and 

I?  2 

Z,  B  BZ.  =  b  =  Z  A, Z 
i  i  1  i  b  i 


since  V'f  V,  and  VD  are  unity. 
A  A  D  D 


Thus 


zi zi  ■  ^  z\hzi'  vl  s  zi 

bi 


The  ratio  of  the  squares  of  the  generalized  singular  values  are  the 
generalized  eigenvalues.  The  generalized  singular  values  are  bounded  by  the 
minimum  and  maximum  eigenvalues  of  A^B,  ara£n  and  a^x*  and  the  minimum 
and  maximum  eigenvalues  of  B^B  3m^n  and  3max, 


min 


and  3  .  <  b  <  g 

max  min  —  i  —  max 


1 


<  a,  <  a 
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Thus  the  generalized  eigenvalues  of  Ui  ■  ai  /^i  are  bounded  by 


amin 

b 

max 


a 

max 

b  . 
min 


We  can  apply  these  results  to  the  generalized  inverse  associative  memory.  A 
generalized  singular  value  decomposition  of  X  and  Y  yields 

X  =  V  b  i^T 
X 

Y  -  V  a  i|jT 

y 

Substituting  into  Ayi)  ■  yields  the  generalized  eigenvalues,  y*  - 

a-[2/bi2,  which  are  the  eigenvalues  to  the  MTM.  These  are  bounded  by  the 
ratio  of  the  eigenvalues  of  and  Ay,  Xx  an<^  *Y  respectively. 

X^Cmin)  Xy(max) 

X^(max)  —  Ui  —  X^(min) 


For  the  correlation  matrix  memory  a  generalized  singular  value 

T 

decomposition  of  Y  and  X1  is  required.  Similar  analysis  yield  the 
following  bounds  in  terms  of  the  eigenvalues  of  Ax  and  Ay. 

XY(min)  Xx(min)  £  y  £  X^Cmax)  X^Cmax) 


As  these  bounds  indicate  the  output  noise  strength  can  be  much  greater  or 
much  less  that  the  input  noise.  In  order  to  achieve  the  minimum  or  maximum 
gain  in  noise  in  Generalized  Inverse  Memories  there  must  be  high  negative 
correlation  in  the  output  when  there  Is  high  positive  correlation  in  the 


inputs  and  vice  versa.  For  correlation  matrix  memories,  high  correlation  of 
same  sign  in  inputs  and  corresponding  outputs  will  cause  large  gain  in  noise. 
A  simple  two  dimensional  example  will  illustrate  the  role  of  the  metric 
matrices  in  this  analysis. 

In  Figure  1.1  are  shown  two  input  vectors  Xj  and  X2  and  in  another  plane 
are  two  output  vectors  Yj  and  Y2. 

The  metrics  Ax  and  Ay  are 


and 


where 


1  d 

*X-t  ) 

d  1 


1  e 

^  (  ) 

e  1 


-1  <  d  <  1 


and 


-1  <  e  <  1 


cos 


e 


Figure  1.1a. 


Two  Input  Vectors  and  Two  Output  Vectors  that  are  Positively 
Correlated  d  >  0  and  F  >  0 


Figure  1.1b.  Two  Input  Vectors  that  are  Positive  Correlated  and  Two  Output 
Vectors  that  are  Negatively  correlated  d  >  0  and  e  <  0 
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The  generalized  eigenvalue  problem  for  the  case  of  the  pseudo  inverse 
associative  memory 

\  *1  ~  AX  *i  Ui 
yields  eigenvectors 

1  a 


*1-/5  O 


and 


*2  “  /2  (-1^ 


with  eigenvalues 


and 


fL£-g] 

a  +  dJ 


rLz_e] 

a  -  dJ 


If  the  inputs  are  positively  correlated  and  the  outputs  negatively  correlated 
then  u 2  can  become  quite  large.  On  the  other  hand  if  both  Inputs  and  output 
are  positively  correlated  and  e  >  d  then  U2  *s  less  than  1  and  a  net  reduction 
in  noise  will  result  in  the  \p2  direction.  If  d  *  e,  both  eigenvalue  are  1 
(case  4),  and  no  net  gain  or  reduction  occurs. 

For  the  correlation  matrix  memory 


AY  5i 


a*1  r 

Ax  5i  U1 


yields  eigenvectors 

C1  *  /2  ^ 


VIV. 

m 

M 


y: 
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•l' 


,  1  ,1  , 
?2  "  /2  (-1^ 


with  eigenvalues 


Ux  -  (1  +  e)  (1  +  d) 


u2  -  (1  -  e)(l  -  d) 


If  both  inputs  and  outputs  are  positively  correlated  uj  >  1,  or  if  both  inputs 
and  outputs  are  negatively  correlated  u2  >  1>  and  noise  gain  will  occur  In  the 
ipl  or  1P2  directions  respectively. 

The  example  illustrates  how  both  input  and  output  metrics  influence  the 
output  noise  to  input  noise  ratios.  The  most  dominating  factor  however  is  the 
Input  metric.  In  application,  the  real  issue  is  confusion,  will  the  input 
plus  noise  look  like  an  incorrect  input?  If  it  does  the  output  will  be  the 
incorrect  output.  The  key  concern  is  the  effects  of  noise  in  directions 
causing  confusion.  These  directions  are  those  pointing  along  input 
differences. 

Noise  that  directly  contributed  to  confusing  inputs  Xj  and  Xj  lies 
along  Xj  -  Xj ,  Njj  *  n(Xj  -  Xj).  The  role  that  the  output  metric 
plays  is  thru  the  correlation  between  outputs. 

For  a  generalized  inverse  memory  the  confusing  output  noise  is 


n(Y1  -  Yj) 


and  thus 


l,N11  11  SYi  "  Yj  a 

ii Nj  n  #x1  -  x^  » 

noise  In  confusing  directions  increases  proportionally  to  the  relative 
separation  of  outputs.  The  example  in  Figure  1  illustrates  this  point.  When 
the  inputs  are  positively  correlated  and  the  outputs  negatively  correlated 
their  is  a  large  noise  increase  in  the  2  direction  which  lies  along 
M(X1  -  X2).  For  this  case  (Y  t  -  Y2)  being  negatively  correlated  is 
corresponding  large.  If  both  inputs  and  outputs  are  positively  correlated  and 
the  outputs  more  correlated  than  the  inputs  then  a  net  decrease  in  noise 
occurs  in  the  Yi  -  Y2  direction,  but  the  distance  between  Yi  and  Y2  decreases 
correspondingly. 

This  analysis  suggests  the  real  concerns  with  noise  are  the  magnitudes  of 
the  input  noise  in  confusing  directions  and  the  distance  between  two  inputs. 

If  the  magnitude  of  the  noise  in  a  confusing  direction  exceeds  the  magnitude 
of  the  distance  between  the  two  Input  vectors,  confusion  will  occur. 

Otherwise  It  will  not.  The  analysis  for  correlation  matrix  memories  Is 
complicated  by  the  cross  talk.  For  input  noise  pointing  along  a  confusing 
direction,  Njj  the  output  noise  no  longer  lies  exactly  along  the  confusing 
direction  In  the  output  space. 

"Nil11  11  C^±  “  *i)  +  (Cj  “  C^H 

II N II  *  IIXj  -  Xjll 

The  difference  in  cross  talk  can  be  either  positive  or  negative  and  the 
component  of  this  difference  along  Yi  -  Y2  can  also  be  either  positive  or 
negative. 
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2.  ADAPTIVE  ASSOCIATIVE  MEMORY  MODELS  WITH  POTENTIAL 
OPTICAL  IMPLEMENTATION 

A  pseudo  inverse  matrix  memory  model  has  been  successfully  implemented 
optically  at  NRL.3  The  algorithm  used  there  is  adaptive.  Recall  is  a  single 
matrix  multiplication.  Let  the  input  vectors,  X,  have  l  components  and  the 
output  vectors  Y  have  p  components  then  the  p  x  l  memory  matrix,  M,  recalls  Y 
by  matrix  multiplying  X 

Y  -  MX 

During  the  learning  phase  of  operation  M  can  be  found  from  a  solution  to 
XtMt  *  Yt.  This  is  a  standard  linear  estimation  problem.  However  if 
the  number  of  input  output  pairs  is  less  than  the  number  of  components  in  the 
input  vectors,  the  system  is  undetermined  and  M  is  not  uniquely  determined.  A 
unique  solution  that  gives  the  memory  matrix  M  minimum  Frobenius  norm  IIMilp 
Is  obtained  from  the  Moore  Penrose  pseudo  inverse  of  X,  X*. 

This  memory  can  be  found  through  an  iterative  algorithm  which  uses  one 
row  of  XT  and  YT  at  a  time,  i.e.,  It  uses  only  one  input  output  pair.  The 
algorithm  was  first  introduced  by  Kacrmarz1  and  later  used  first  in 
associative  memories  by  Widrow.2 

The  key  step  in  the  algorithm  is 

^  Art  (p 

M  (k  +  1)  -  M  (k)  +  A  Xx  (Y*  -  X^  M*) 

This  algorithm  is  currently  implemented  at  NRL. 3  With  minor  modifications  the 
algorithm  can  be  used  to  find  the  pseudo  inverse  of  X  or  to  be  a  novelty 
filter  for  X.  By  choosing  Y  to  be  the  n  x  n  unit  matrix  the  algorithm  will 


yield  the  pseudo  inverse  of  X.  By  initialized  M(o)  to  the  Jt  x  l  unit  matrix 
and  setting  Y  to  zero  M  becomes  a  novelty  filter.  The  method  of  Kaczmarz  has 
spawned  a  large  variety  of  algorithms  for  large  systems  and  their 
applications.  These  methods  are  known  as  row  action  methods  since  in  a 
single  step  only  one  row  of  the  metrics  are  required.  Some  basic 1 algorl thms 
that  may  have  use  in  implementation  are  included  in  the  following  section. 

Row  Action  Methods 

We  outline  here  a  few  of  the  row  action  methods  which  may  be  applicable 
to  optical  implementation  of  associative  memory  and  recall.  Since  the 
algorithms  can  be  used  for  both  purposes  we  describe  them  here  in  a  generic 
notation  commonly  used  for  systems  of  linear  equations 

A  x  *  b 

This  system  may  be  underdetermined,  greatly  overdetermined  with 
possibility  of  inconsistencies  (self  contradiction)  and  it  can  be  ill- 
conditioned.  There  may  be  reason  to  doubt  that  the  exact  algebraic  solution 
is  the  desired  one,  especially  if  the  data  contains  measurement  inaccuracies 
and  or  noise,  distortion,  and  missing  data.  These  other  approaches  include; 

( i)  Constrained  Minimization 

Min  II  Ax  -  bll 

subject  to  x  e  S.  By  x  eS,  we  mean  it  may  be  required  to  satisfy  some  set  of 
constraints  for  example  box  constraints. 


.  -r.  ....  u-»  W'.V  V.  V  V  W 


(II)  Regularization 
Minimize 

[  f  (x)  +  X  II Ax  -  Bll  ] 

2 

where  f  (x)  is  usually  the  square  norm  of  x,  Jx#  ,  and  xeS 

(III)  Feasibility 

For  each  row  vector  of  A 

(bi  *  Yj)  <  ajx  <  (b£  +  xeS 

The  Yj  and  are  picked  so  that  the  feasibility  region  is  not  empty. 

( iv)  Optimization 
Minimize  F(x)  subject  to 

(b£  -  y£)  <  a£x  <  (b£  -  C£)  xeS 

The  basic  algorithm  is  the  Kaczmarz  algorithm.  It  solves  Ax  *  b. 

In  this  notation  the  ntb  step  is 

n  +  1  n  ,  r,  T  n> 

X  "  X  +  uk  aJbk  “  \  x  ) 

where  *  X^/lla^ll  and  A^  called  the  relaxation  parameter  is  bounded 
by  0  <  X^  <  2.  In  the  field  of  Image  reconstruction  from  projections  this 
algorithm  is  known  as  ART. 


►  *  •  *.•  *  -  *  •  ’  •  *  •  * 
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The  most  similar  algorithm  is  the  relaxation  method  of  Agmon  and  Motskin 
and  Sihoenberg10  for  the  linear  feasibility  problem.  Collecting  all 
constraints  and  expressing  them  as 

Ax  £  b 

the  basic  step  is 

n  +  1  n  T  _ 
x  *  x  +  a,  G. 

k  k 

where 


k  ,  T  n-, 

g  =  W  ak  x  i 

for  the  right  hand  side  negative  and 


gk  *  0 


otherwise. 

The  interval  feasibility  problem  can  be  implemented  by  repeating  the 

T 

rows,  one  positive  to  satisfy  x  £  b^+  and  one  negative  to  satisfy 


_ai  x  <  (b!  "  Y4). 


A  method  for  regularization  has  been  developed  by  Herman  et  al.,  The 


procedure  minimizes  f(x)  =  lixll2  +  y2  tlAx  -  bfl^  the  ntn  step  is 


n  +  1  n  . 
z  -  z  +  Vk 


n  +  1  n 


x  +  Yak  gk 


where  e^  is  the  vector  with  elements  zero  except  for  the  kth  which  is  unity, 
i.e,,  the  kth  column  of  the  unit  matrix.  z°  is  arbitrary  and  x°  * 
yaTz°.  gv  is  found  from 


uk  +  LY  ^ “k  "  akJ  x  J  "  2 


gk 

where 


Mk  2 

1  +  Y II  ak  U 

Each  of  these  methods  may  be  implemented  optically.  The  feasibility  method 
however  requires  a  thresholding  step  for  g^.  The  regularization  method  may 
require  a  second  spatial  light  modulator  to  store  z. 

All  of  the  above  algorithms  can  be  made  nonlinear  by  requiring  x  to 
satisfy  constraints  x  e  S.  To  implement  the  box  constraints  for  example  the 
output  at  the  end  of  each  step,  xn,  can  the  threshold  so  that 


n  n  . »  y  ^  / 

Xi  =  Xi  if  ci  1  xi  1  pi 

A 

n  . ,  n  . 

xi  '  h  11  xi  i 


if 


and  xn  is  used  to  compute  x04^.  Simple  thresholding  can  be  used  to  solve 
these  problems  subject  to  the  constraint  conditions. 
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3.  ON  OPTIMAL  ASSOCIATIVE  RECALL  FROM  AN  INCOMPLETE  INPUT  VECTOR 


An  issue  of  concern  in  associative  memory  design  is  its  robustness 
against  noisy  Input  data.  Without  a  noise  model  the  problem  is  quite 
formidable.  A  problem  that  is  somewhat  less  difficult,  that  of  recall  from 
input  patterns  that  have  been  masked  so  that  some  components  are  missing 
(i.e.,  equal  zero),  has  been  attacked  more  frequently.  An  interesting  paper 
by  Marakami  et.  al.,  found  an  optimal  associative  memory  for  recall  of  a 
pattern  from  a  Input  pattern  that  had  a  known  fixed  set  of  components  equal  to 
zero.  The  memory  is  optimal  is  the  sense  of  best  linear  unbiased  estimate 
(i.e.,  least  squares).  If  we  let  the  input  vectors  have  a  total  of  n 
components  and  let  n  -  s  of  the  components  be  masked,  the  optimal  associative 
memory  for  inputs  with  only  s  out  of  n  components  present  is 


3M  =  (n  -  l)  YXT  [0  -  1)  XXT  +  (n  -  s)  R]1 


where  R  is  the  diagonal  of  XX^ 

This  result  can  be  simplified  if  the  rows  of  X  are  equilibrated  so  that 
they  have  unit  length.  If  we  assume  that  this  has  been  done  then  R  =  1  and 


SM  =  (n  -  l)  YXT 


-  l)  (xTx)  +  (n  -  s)  1  r  x1 


This  expression  is  valid  for  0  <  s  _<  n.  For  s  =  1,  only  1  component 
unmarked  the  memory  reduces  to  a  correlation  matrix  memory.  For  s  =  n  it 
reduces  to  the  Moore  Penrose  pseudo  inverse  memory.  For  s  *  n  the  memory 
resembles  a  regularization  Inverse.  In  fact  this  memory  can  be  arrived  at 
from  a  constrained  minimization  problem.  Minimize  the  average  noise  output 
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strength  of  the  memory,  keeping  the  total  mean  square  error  in  memory 
components  equal  to  R.  The  problem  takes  the  form 


Min  IIMTMC 


F 


subject  to 


HMX  -  YJ  -  R 


where 


is  the  Frobenius  Norm. 

When  used  in  recall  the  memory  also  solves  a  minimization  problem.  We  can 
factor  the  solution  into  two  steps.  We  write  sm  as 


Step  (1)  is  solve 


I 

m  T  'Ti*  *  | 

(X  X  +  -I)  X  X  -  t  j 

t 

then  step  (2)  multiply  t  by  Y  to  get  the  output  Y.  | 

Step  (1)  is  equivalent  to  solving 

X  *  Xt 

using  regularization  techniques,  that  is  it  is  equivalent  to 


Min  II  til 

subject  to  HXt  -  XU  *  R 

By  keeping  the  elements  of  t  small  we  keep  the  memory  outputs  smooth,  we 
reduce  the  likelihood  of  large  oscillatory  mixing  of  output  components.  *  is 
the  reciprocal  of  the  Lagrange  multiplier  which  makes  the  solution  satisfy  the 
constraints.  In  applications,  one  typically  solves  step  1  for  a  set  of  “'s 
and  then  settles  on  a  residual  and  norm  which  seems  reasonable.  The  optimum 
associative  memory  selects  «  to  be  large  compared  to  typical  regularization 
values  and  hence  excepts  a  rather  large  residual  for  minimum  noise  output. 
Given  that  n  -  s  components  of  the  input  are  known  to  be  wrong  this  is  quite 
reasonable.  An  algorithm  which  would  allow  optical  implementation  of  this 
memory  is  given  in  Section  2. 
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4.  PROJECTION  OPERATORS  AND  SUBSPACE  METHODS  OF  PATTERN  RECOGNITION 


A  standard  classification  problem  is  that  of  identifying  an  unknown  input 
as  being  a  member  of  a  previously  learned  class.  This  identification  process 
is  typically  hampered  by  corrupting  noise  as  well  as  the  event  that  a  current 
unknown  input  may  be  either  totally  new  and  not  a  member  of  any  of  the 
previously  learned  classes,  or  may  be  a  mixed  signal  which  belongs  to  two  or 
more  of  the  previously  learned  classes. 

The  basic  concept  of  similarity  among  patterns  is  that  the  features  of 
similar  patterns  are  similar.  If  these  features  are  used  to  form  a  feature 
space,  each  pattern  can  be  represented  by  a  vector  in  this  space.  Similar 
patterns  will  have  similar  vectors,  that  is,  vectors  which  are  all  in  some 
local  region  of  the  space  and  are  close  together  in  terms  <-f  some  metric.  In 
subspace  methods  the  feature  space  is  divided  into  subspaces,  one  subspace  for 
each  class  of  patterns.  Ideally  these  subspaces  would  have  no  intersections. 
The  subspace  would  be  mutually  exclusive  pattern  vectors  of  a  class  would  lie 
totally  within  the  subspace  of  that  class.  The  distance  of  an  input  pattern 
from  a  class  subspace  is  a  measure  of  its  similarity  to  the  patterns  of  the 
class. 

1  9 

Orthogonal  projection  methods,  ”  which  use  a  different  orthogonal 
projector  to  represent  the  subspace  spanned  by  a  class  have  been  developed  and 
extensively  studied.  For  potential  mixed  patterns,  the  lengths  of  the 
projections  of  an  unknown  onto  each  of  the  class  subspaces  can  be  related  to 
the  amounts  of  the  individual  class  that  are  present  in  the  unknown.  The 
relation  is  not  a  simple  one  when  the  classes  are  not  orthogonal  to  one 
another.  The  use  of  oblique  projectors  rather  than  the  more  common  orthogonal 
projectors  yields  an  attractive  alternative.  The  major  difficulty  with 
orthogonal  projection  is  that  unless  different  class  subspaces  are  orthogonal, 
there  can  be  a  large  projection  of  an  unknown  of  one  class  onto  the  subspace 


of  another  class.  This  Is  a  problem  for  both  Inputs  which  are  mixtures  and 
pure  Inputs  which  are  noisy.  We  show  in  Subsection  4.4  that  for  more  than  two 
classes,  noisy  inputs  can  cause  misclassif ication  in  orthogonal  projections 
while  the  same  inputs  are  correctly  classified  by  the  corresponding  set  of 
oblique  projections.  For  mixtures  there  is  another  complication  due  to  lack 
of  orthogonality  of  subspaces.  A  possible  identifier  for  a  mixture  of  two 
classes  would  be  the  length  of  the  projection  of  the  unknown  onto  the  subspace 
spanned  by  the  union  of  the  two  class  subspaces.  However,  for  nonor thogonal 
subspaces  the  projection  onto  the  union  is  not  the  sum  of  the  projections  onto 
each  class. 

The  oblique  projection  techniques  are  robust  against  noise  and  will 
correctly  predict  the  input  until  the  noise  has  a  larger  component  in  another 
class  subspace.  Orthogonal  projection  can  fail  before  the  noise  is  this  large 
because  of  cross  talk  between  classes. 

The  major  advantage  of  oblique  projection  operators  is  that  the  feature 
space  can  be  divided  into  independent  mutually  exclusive  subspaces  by  a  set  of 
mutually  exclusive  oblique  projection  operators.  These  operators  will  give 
zero  projection  for  a  pattern  belonging  to  a  different  subspace  or  class. 

The  use  of  oblique  projectors  in  pattern  recognition  has  been  somewhat 
limited  to  date10  although  the  mathematical  theory  has  been  developed.11"14 

To  introduce  the  concepts  of  oblique  projections  the  elementary 
properties  of  idempotent  operators  are  reviewed.  The  relation  of  these 
projections  to  least  square  techniques  and  regression  is  illustrated.  A 
constructive  scheme  for  the  projections  based  on  least  squares  analysis 
presented  is  in  Subsection  4.3. 

In  Subsection  4.4  a  geometric  interpretation  of  oblique  projections  is 
presented  and  a  comparison  between  orthogonal  and  oblique  projection 
classification  schemes.  A  general  method  of  construction  of  oblique 
projection  operators  is  presented  in  Subsection  4.5.  The  subspace  method  can 
be  generalized  to  include  bounded  regions  or  zones  by  the  introduction  of 
constrained  projections.  These  add  nonlinearity  and  provide  a  means  of  adding 


additional  information  about  a  class  when  it  is  known.  The  implementation  is 
discussed  in  Subsection  4.6.  The  relation  between  generalized  inverses  and 
projections  is  discussed  in  Subsection  4.7.  This  introduction  of  weighted 
features  into  subspace  methods  is  discussed  in  Subsection  4.8. 

4.1  Elementary  Properties  of  Idempotent  Operators  and  Projectors 

The  general  properties  of  idempotent  operators  and  the  decomposition  of 
spaces  into  linearly  independent  subspaces  is  reviewed.  Such  operators  play  a 
natural  role  in  pattern  recognition. 

An  operator  P  is  idempotent  if  P2  «  P.  If  P  is  idempotent  then  its 
complement  I  -  P  is  also  idempotent. 

(I  -  P)  (I  -  P)  -  I  -  2P  +  P2  =  I  -  2P  +  P  -  I  -  P 

The  major  feature  of  idempotency  and  the  use  of  idempotent  operators  in 
pattern  recognition  stems  from  the  following. 

An  idempotent  operator  P  decomposes  a  space,  $,  into  two  linearly 
independent  subspaces,  <$> ^  and  <j>2  such  that  if  V  belongs  to  <j>j  then  PV  *  V 
and  if  V  belongs  to  <j>2,  PV  -  0.  Every  vector  y  In  $  can  be  decomposed  into 
y  »  y1  +  y2  where 

Py  a  yx  > 


and 

(I  -  P)  y  -  y2 

yj  belongs  to  <t> ^  and  y2  belongs  to  <$>2.  It  is  useful  to  introduce  a  votation 
change  before  generalizing  to  several  subspaces.  Let  Pj  ■  P  and  P2  *  1-P, 
then  P^y  *  y^  and  P2y  =  y2. 


The  desired  properties  for  an  independent  subspace  method  for  pattern 
recognition  are  provided  by  a  set  of  mutually  exclusive  idempotent  operators. 

Given  a  set  of  K  idempotent  operators  Pj,  P2,  ...  Pj(  such  that 


PI  PJ 


5IJ  PJ 


where 

<5jj  is  the  Kronecker  delta 

and 


K 

l  p. 


then  decomposes  into  k  linearly  independent  subspaces  <J> 4> 2  ...  4>k •  p<j> 
is  the  total  projector.  It  projects  a  function  into  the  space  $.  If  ^  is  the 
entire  space  *  I,  the  identity  operator. 

Each  of  the  K  subspaces  are  defined  as  the  set  of  all  vectors  (patterns) 
belonging  to  the  particular  The  subspaces  are  independent  but  not 

required  or  assumed  to  be  orthogonal  in  order  for  the  mutual  exclusive 
property  to  hold.  If  the  pattern  V  belongs  to  subspace  <pj  (class  J)  then 

PjV  -  V 

and 

PjV  -  0 


for  I  *  J. 


For  such  a  classification  scheme,  patterns  that  are  mixtures  of  two  or 
more  classes  can  be  Identified.  Let 


then 


and 


V  "  VL  +  VM 


PlV«vl  ;  PMV-VM 


Pj-V  -  0 

for  1  *  L  or  I  t  M. 

The  key  to  the  use  of  operator  techniques  is  the  determination  of  the 
desired  operators  and  finding  representations  for  them.  If  the  subspace  ^ 
are  orthogonal  the  idempotent  operators  will  be  symmetric.  These  operators 
are  called  orthogonal  projection  operators.  If  the  subspace  .J>K  are  not 
orthogonal,  the  idempotent  operators  will  not  be  symmetric.  These  idempotent 
operators  are  called  oblique  projection  operators. 

Characteristics  and  construction  methods  for  oblique  and  orthogonal 
projection  operators  is  further  developed  below. 


4.2  Orthogonal  Subspaces  and  Orthogonal  Projection  Operators 


Two  subspaces  are  orthogonal  if  every  vector  belonging  to  subspace  <j> i  is 
orthogonal  to  every  vector  belonging  to  subspace  $2*  Orthogonal  subspaces  are 
special  cases  of  linearly  independent  subspaces,  and  hence  can  be  analyzed 
using  idempotent  operators.  The  idempotent  operators  in  turn  will  have 
special  characteristics.  Let  Rn  be  divided  into  two  orthogonal  subspaces. 

<J)  i  and  $2  •  There  exist  idempotent  operators  P^  and  P2  =  I  -  Pi  such  that 


The  special  characteristic  is  that  the  idempotent  operators  associated  with 
orthogonal  subspaces  must  be  symmetric. 

To  show  this  we  consider  two  vectors  U  and  V  and  their  decompositions 


U  “  U1  +  U2  “  P1U  +  P2U 

and 

V  «  Wj  +  w2  -  PjV  +  P2v 

Due  to  the  orthogonality  of  the  subspaces 

T  T  T 

u;  v  -  u;  vi  -  uA  Vj 


as 


T 

U1  V2 


0 


and 


T 

U2  V1 


0 


T 

The  scalar  product  U.  V  can  be  written  as 


The  scalar  product  U^V i  can  be  written  as 


utp1  V 

Since  the  two  scalar  products  are  equal 


that  Is  Pi  Is  symmetric.  Operators  which  are  both  idempotent  and  symmetric 
are  called  orthogonal  projection  operators.  The  orthogonal  projection  of  a 
vector  onto  a  subspace  has  a  simple  geometric  interpretation.  For  PiV  *  Vi, 

V  -  Vi  is  clearly  orthogonal  to  V.  The  key  property  is  that  the  orthogonal 
projection,  V*,  Is  that  vector  of  411  which  is  closest  in  distance  to  V. 

(V  -  Vi )  Is  the  shortest  vector  between  V  and  the  space  1 .  For  any 
nonorthogonal  (oblique)  projection,  Vif  (V  -  Vi)  is  longer  than  (V  -  Vx). 

4.3  Projection  Operators  and  Linear  Estimation 

Let  <(>  be  a  subspace  of  Rn  and  let  {W^}  be  a  basis  for  <j>.  Then  the 
orthogonal  projection  of  a  function  f  onto  if  can  be  obtained  through  the  least 
squares  procedure 

f  «  WC  +  R 

where  the  expansion  coefficients  are  given  by 

c  -  (wTW)_1  VTf  -  A-1  WTf 
The  desired  projection  operator  is 

-1  T 

P  -  W  A  W 


Thus  we  have 


y  ■  Py  and  R  ■  (l  - P]y 

The  K  dimensional  subspace  <j>  may  be  broken  down  into  k  one  dimensional 
subspaces  associ;  ted  with  each  W^.  If  these  subspaces  are  orthogonal  then  A 
is  the  unit  matrix  and 


^  T 

I  W  uj 

1-1 


K 

l  p 

I 


I 


That  is  If  the  one  dimensional  subspaces  are  orthogonal  P<|>  can  be  written  as 
the  sum  of  K  one  dimensional  orthogonal  projection  operators. 

It  is  always  possible  to  generate  orthogonal  subspaces  by  using  standard 
orthogonalization  procedures,  Gam  Schmidt  for  example.  However,  if  the 
nonorthogonal  basis  has  a  model  Interpretation  associated  with  it,  this 
interpretation  may  be  lost  on  orthogonalization. 

When  the  subspaces  are  not  orthogonal,  the  projection  onto  the  subspaces 
$1  will  not  be  orthogonal.  The  individual  projections  will  be  oblique.  The 
metric  A  and  its  Inverse  A-1  can  be  partitioned  such  that  the  projection 
operatore  P$  can  be  expressed  as  a  sum  of  oblique  projection  operators 


The  opera  tore  Oj  is  the  oblique  projection  operator  for  subspace 


but 


°I  =°I 


T 

°I  *  °I 


The  oblique  projection  operators  have  the  following  properties 

0  W  *  5  W 
UI  WJ  IJWI 


( 1  -  Oj )  Wj  -  0 


°I  °J  •  SIJ°I 


p.  ■  {  °I 


Property  a  is  key  to  the  use  of  these  operations  for  analysis  of  mixed 
unknowns,  unknowns  with  membership  in  more  than  one  class.  If  unknown 

U  -  WT  CT  +  VT  CT 


°KU  -° 


for  all 


K  *  I  or  J 

It  is  useful  to  compare  this  result  with  the  prediction  of  independent 
orthogonal  projections  on  a  mixed  unknown.  The  orthogonal  projection 
operation  for  subspace  $ j  is 


P  ■  w  w 
1  1  1 


The  projection  of  unknown  U  onto  is 


pi  u  -  wi  (c2  +  a12  c2; 


Since,  in  general. 


where 


P  W  »  W  A 
I  J  I  IJ 


A  *  y  y 

ij  i  j 


is  the  overlap  between  the  subspaces. 


where 


P1  U  "  W1  B1 


B1  *  C1 
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The  effect  of  orthogonal  projection  is  to  include  some  of  the  #2  component 
Into  the  projection.  This  is  an  undesirable  property.  The  oblique  projection 
approach  is  more  appropriate  for  nonor thogonal  subspaces. 


We  can  conclude  that  the  best  linear  unbiased  approximation  of  a  function 
f  by  a  basic  set  of  functions  {Wj}  is  the  orthogonal  projection  of  f  onto 
the  subspace,  #,  by  the  basic  {Wj}.  This  orthogonal  projector,  P,  is  the 
sum  of  individual  projectors,  one  for  each  of  the  component  subspaces  of  <£. 

The  individual  component  projectors  will  be  orthogonal  if  the  basis  is 
orthogonal.  The  component  projections  will  be  oblique  if  the  basis  is 
nonorthogonal. 


4.4  Geometric  Interpretation  of  Oblique  Projections 

Geometric  Interpretations  of  oblique  and  orthogonal  projection  operators 
are  presented.  The  relationship  between  coefficients  obtained  for  orthogonal 
and  oblique  projections  as  compared  for  simple  R  and  R  examples.  We  have 
shown  that  the  minimum  distance  classifiers  defined  in  terms  of  oblique  and 
orthogonal  projection  methods  agree  only  for  the  two  class  case  in  R2.  A 
simple  example  in  R3  shows  that  orthogonal  projections  will  not  predict 
correct  classifications  for  nonorthogonal  classes. 


Oblique  projections  require  two  subspaces  for  their  definition,  the 
subspace  into  which  projection  occurs  and  the  subspace  into  which  the  adjoint 
projection  occurs.  We  label  these  subspaces  $  and  f,  respectively. 


Of  is  the  projection  of  f  onto  which  is  along  '4.  0T  f  is  the 

projection  of  f  onto  ¥  which  is  along  For  a  function  f  and  the  oblique 

projection  operator  0,  Of  lies  along  j.  See  Figure  4.1.  The  projection  is 
along,  r^,  the  space  orthogonal  to  ¥.  is  the  orthogonal  complement  of 
¥.  Ot  is  also  the  oblique  complement  of  j>.  The  adjoint  operator,  0^, 
projects  f  onto  f .  This  projection  is  along  or  parallel  to,  4r^,  the 
orthogonal  complement  of  4>.  is  also  the  oblique  complement  of  ¥.  If  <J> 
and  ¥  are  equal,  then  the  projection  is  orthogonal.  One  idempotent  operator 
generates  two  pairs  of  complementary  projections  (0,  1-0)  and  (0^,  1-0^) 
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••••  .n- •  . 


I*  I-  "v ■*>*•> S 


which  projects  onto  complementary  subspace  pairs  (<j>,  and  ( ¥ ,  1-V) 

respectively. 

We  concentrate  now  on  a  complimentary  pair  and  their  relation  to 
orthogonal  projections.  Consider  the  projections  Oj  and  Oj  =  1  -  Oi  on 
the  independent  but  nonor thogonal  subspaces  and  <p j  =  In 

Figure  4.2  an  R  example  is  given.  Both  oblique  projections  of  a  function  f 
onto  <J>i  and  <f> j  are  illustrated  as  well  as  independent  orthogonal 
projections  Pjf  and  Pjf.  We  illustrate  that  for  the  two  class  example  in 
R2  if  Cj  >  Cj  then  Bj  >  Bj.  This  means  that  either  the  orthogonal 
projector  or  the  oblique  projector  will  lead  to  the  same  prediction  if  used  as 


a  minimum  distance  classifier  for  a  two  class  problem.  We  will  also  show  that 
this  does  not  generalize  to  a  three  class  problem.  If  we  let  Wj  be  a  basis 
for  and  Wj  be  a  basis  for  then 


and 


°If  "  WICI 


°Jf  "  WJCJ 


P  f  m  U  B 


pjf  *  wjbj 


The  projections  coefficient  B  and  C  are  related  through  the  metric 


B  =  AC 


For  A  equal  to 


1 

d 


where  the  W  vectors  are  assumed  to  be  normal  vectors  and  d  is  the  overlap 
between  classes  0  <  d  <  1. 


the  difference  Bj-Bj  is  proportional  to  Cj-Cj.  If  class  selection  is 
based  on  the  larger  projection,  then  for  Cj  >  Cj;  Bj  >  Bj  and  both 
oblique  and  orthogonal  projection  predict  class  I.  These  results  do  not 
extend  to  the  three  class  problem.  Consider  the  metric  A 

then 

W  dCJ  +  eCK 
BJ  *  °J  +  dCI  +  fCK 

and 

BI  -  Bj  -  (Cj-CjXl-d)  +  (e-f)CR 
T 

The  coefficients  Bj  =  w  f  are  unnormalized  correlation  coefficients. 

I 

The  correlation  between  Wj  and  the  input  f  is 

BI 

rl  II W II x  llfll 


The  magnitude  of  Bj  is  the  length  of  the  orthogonal  projection  of  f  into  the 
subspace  The  coefficients  Cj  are  related  to  signal  strengths.  If 

both  Wj  and  f  are  normalized  to  one  the  square  of  the  coefficient  |Ci|2 
is  the  best  unbiased  estimate  of  the  strength  of  Wj  in  the  function  f.  If 
the  functions  f  are  mixtures  of  subspaces,  then  the  deconvolution  of  these 
into  subspace  components  requires  the  oblique  projectors.  The  strength  of 
the  subspace  contributions  is  given  by  f C j I  2 . 


The  fact  that  the  correlations  Bj  do  not  necessarily  follow  the  Cj  in 
relative  magnitude  raises  concerns  about  the  use  of  orthogonal  projection  as  a 
classification  technique  when  inputs  unknowns  are  noisy.  If  a  pure  input  from 
subspace  is  corrupted  by  noise  which  has  components  in  subspaces  and 
the  relative  magnitude  of  the  correlation  coefficients  may  not  follow 
the  relative  strengths  of  the  signal  and  noise  components. 

4.5  General  Method  of  Construction  of  Oblique  Projectors 

A  general  method  for  generation  of  oblique  projections  is  presented. 

This  approach  is  independent  of  a  least  square  model.  An  oblique  projection 
operator  can  be  defined  in  term  of  a  pair  of  subspaces  $  and  ¥  of  equal 
dimension.  Given  W  as  a  basis  for  <j>  and  Z  as  a  basis  for  ¥  the  projector  0 

TIT 

0  »  w(z  w)  z 

is  defined  where  (ZTW)*  is  a  generalized  inverse  of  (ZTW).  0  is 
idempotent  and  therefore  a  projector. 

o2  *  w(zVzT  (zV  ZT 
-  w(zTw)1  zT  -  0 


In  applications  the  subspaces  i£a  and  tpg  of  dimension  nA  and  ng 
respectively  will  be  known.  The  basis  WA  for  $>A  and  Wg  for  ij>g  can  be 
determined.  To  construct  the  desired  projection  operator  0A  for  which 


A  basis  ZA  is  required.  This  basis  will  span  the  nA  dimensional  subspace 

*A. 

The  subspace  must  have  an  orthogonal  projection  into  <J>A  and  must 
be  orthogonal  to  $3.  The  desired  basis  ZA  can  be  constructed  by  taking 
the  basis  WA  and  or thogonalizing  it  to  the  basis  Wg, 


ZA  *  WA  *  Va 


where 

«  ■  ("bub)'‘ 

is  the  desired 

"b  X  "A 

orthogonaliza tion  matrix. 
Similarly 


(abb-*  aba 


z  »  w  -  wo 

B  B  A  B 


where 


Qr  -  (wIwa)_1  0*0 


’A"BJ 


After  a  bit  of  algebraic  manipulation  it  can  be  shown  that  0  is  the  orthogonal 
projector  for  the  space  $  =  $A  +  #g,  This  generation  method  is 
independent  of  initial  basis  selection  for  $A  and  $g  and  is  independent  of 
the  or thogonaliza tion  method  for  generation  of  ZA  and  Zg.  This  is  the 
consequence  of  the  invariance  of  0  to  nonsingular  transformation  of  the  bases 


U  and  Z.  Let  X  and  Y  be  nonsingular.  Then  let 

A  A 

W  ■  WX  and  Z  »  ZY  then 


A  A  Am a  ^  Am 

0  -  w  (zAw)  z 


T  T  I  T 
.WX  (Y  Z  WX)  YZ 


wxx"1  (ztw)y_1  yzt  *  0 


Therefore  0*0  and  projection  is  independent  of  the  bases  W  and  Z 


4.6  Constrained  Projection  Operators 


The  representation  of  a  class  as  a  subspace  of  its  features  does  not 
always  contain  all  of  the  information  that  in  known  about  the  class.  For 
example  with  spectral  classes,  the  absorbances  are  always  positive,  but  the 
subspaces  spanned  by  a  set  of  spectra  include  both  positive  and  negative 
absorbances.  In  order  to  avoid  nonphysical  spectral  estimates  as  well  as 
incorrect  classifications,  it  is  ureful  to  constrain  the  absorbances  to  be 
positive.  We  develop  below  the  projection  method  for  two  types  of 
constraints,  inequality  constraints  and  equality  constraints.  The  constraint 
methods  are  developed  in  the  framework  of  constrained  least  squares.  The 
result  is  a  set  of  operators  for  constrained  oblique  projection. 

The  solution  to  the  least  square  problem  can  be  expressed  as 


minimize  (b-Ax)  (b-Ax) 


with  respect  to  x.  The  solution  Ax  Is  given  by 


*  T  — 1  T 

Ax  -  A^A)  A  b  -  Pb 

where  P  Is  the  orthogonal  projector.  The  Inclusion  of  constraints  can  be 
accomplished  using  Lagranglan  techniques. 15  Constraints  of  the  form  Gx  *  0, 
equality  constraints  and  Gx  >  0  inequality  constraints  will  be  considered. 

The  more  general  constraints  relations  Gx  ^  h  can  be  converted  to  the  above  by 
the  transformation  x  *  y  +  G^h  where  G^  is  the  pseudoinverse  of  G. 

Equality  Constraints 

The  least  square  method  with  equality  constraints  can  be  expressed  as 

T 

minimize  (b-Ax)  (b-Ax) 

subject  to  Gx  *  0. 

The  associated  Lagranglan  is 

,r  1  T.T.  T.T. 

L(x,XJ  ■  —  xAAx-xAb-  XGx 

A  saddlepolnt  solution  of  L(x,X)  is  obtained  if 

T  T  T 

A  Ax  +  A  b  -  G  X  -  0 

Gx  *  0 

and 

X  >  0 


X  is  the  vector  of  Lagrange  Multipliers 


The  solution  x  of  the  constrained  problem  can  be  given  in  terms  of  the 
unconstrained  problem  as 

a  T  —1  T 

X  -  x  +  (a  a)  G  X 

where  X  is  given  by 

X  =  -  [g(aTa)"1GT]“1  Gx 
The  solution  Ax  can  be  expressed  as 

AX  -  A  Q(ATA)_1Ab  -  PQb 

where 

Q  *  1  -  (ATA)"1GT[G(ATA)“1GT]  G 

A  bit  of  algebra  will  reveal  that  Q  is  idempotent  and  hence  a  projector 
Q2  *  Q  and  further  P2  -  Pq 

Q  projects  the  unrestricted  solution  x  onto  the  constraint  subspace. 

The  residual  includes  the  unrestricted  residual  as  well  as  the  residual 
arising  from  projection  onto  the 

Rq  -  A(l-Q)(ATA)'1ATb 

subspace  that  violates  the  constraints. 


Inequality  Constraints 

For  inequality  constraints  the  optimization  problem  is 

Minimize  (b-Ax)^(b-Ax) 

Subject  to  Gx  5^  0 

Following  the  method  outlined  for  equality  constraints  the  solution  for  x  and 
A  are  obtained  from  the  solution  of 

(ATA)X  -  ATb  -  GTA  -  0 
Gx  >_  0 
AGx  -  0 

and 

A  >  0 

The  major  difference  between  equality  and  inequality  constraints  is  that 

A 

the  inequality  constraints  need  not  be  active.  If  x  satisfies  the  constraints 
in  that 


Gx  >  0 

then 

A 

X  ■  X 


A  -  0  is  the  solution.  In  general  for  every  component  xj  of  x  that  is 
greater  than  zero,  the  corresponding  Lagrange  multiplier  A*  is  zero.  The 
nonzero  Aj  corresponds  to  active  constraints  xt  *  0.  These  correspond  to 


the  unrestricted  solutions  x^  violating  the  constraints,  0.  The 
nonzero  Aj  are  given  as  before  for  equality  constraints  and  x,  A  and  Pq 
have  the  same  properties.  Q  however  depends  on  b,  If  for  example 


Gx  -  G(ATA)-1ATb  >  0 

then 

Q  *  1  . 

Without  loss  of  generality  we  can  assume  that  the  first  k  constraints  are 
inactive  and  the  remaining  constraints  are  active.  Then  G  can  be  partitioned 
into  [Gj^P  where  G2  corresponds  to  the  active  constraints  and  the  nonzero 
Lagrangian  components  solved  for  from 

A  -  -  [G2(ATA)_1  G If1  GT2  x 

The  projection  operates  Pq  for  the  constrained  least  square  problem  can  be 
considered  as  a  sum  of  oblique  projection  operators,  one  for  each  class  as  in 
the  case  of  unconstrained  least  squares. 

4.7  Relation  Between  Generalized  Inverses  and  Projection  Methods 

A  relationship  between  certain  generalized  inverses  and  constrained 
oblique  projection  operators  was  identified.  The  variety  of  generalized 
inverses  suggests  that  a  large  variety  of  potentially  useful  projection 
operators  can  be  generated. 

Oblique  projectors  and  constraints  can  be  considered  in  terms  of 
generalized  inverses  of  matrices.  A  projection  operator  can  be  written  as 
Py  *  W  W1  where  W*  is  the  generalized  inverse  of  W.  The  selection  of 
the  Moore  Penrose  inverse 
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r.  .VA  A 


w  «  (w  w)  w  »  (w  w)  w 


A  AW 


leads  to  the  orthogonal  projector  Py  =  WA_1W^.  The  Moore  Penrose  inverse 
is  identical  to  the  true  inverse  for  nonsingular  matrices.  The  existence  of 
A"1  is  based  on  the  linear  independence  of  the  columns  of  w.  The  Moore 
Penrose  inverse  X  satisfies  the  following  four  properties: 


AXA  =  A 

(1) 

XAX  »  X 

(2) 

(ax)T  »  AX 

(3) 

T 

(XA)  “  XA 

(4) 

An  example  of  a  generalized  inverse  that  does  not  satisfy  the  four  conditions 
Is 


(zV 


This  inverse  satisfies  conditions  1,  2  and  4  but  not  3  as 


o  -  a(zta)1  zt  *  oT  -  z^z)1  at 

Condition  (3)  is  equivalent  to  requiring  A  A1  to  be  an  orthogonal 
projection.  The  oblique  projections  themselves  contain  a  generalized 
Inverse.  The  inverse  A*  ■  (Z^A)*  2 ’?  is  called  a  1,  2,  4  inverse.14 
In  this  nomenclature  the  Moore  Penrose  Inverse  is  a  1,  2,  3,  4  inverse. 


The  construction  of  Z  as 


Z  »  A  -  B(BTB)"1  (BTA) 

Insures  that  Z^A  will  be  singular  only  if  the  subspaces  spanned  by  A  and  B 
are  not  independent.  If  the  subspaces  are  dependent  then  one  or  more  of  the 
columns  of  Z  will  be  of  zero  length.  For  nonsingular  Z^A,  the  generalized 
inverse  (Z^A)*  can  be  replaced  with  the  true  inverse  in  A*  and  0 
respectively.  The  Moore  Penrose  inverse  is  identical  to  the  true  inverse  in 
this  case.  If  we  let  (Z^A)*  ■  (Y^Z^-A)^  Y^  and  substitute  this 
inverse  into  0  we  have 

0  >  A(ZV  ZT  -  a(yVa)1  YTZT 
using  the  Identity 


(ztA)  (z^r1  - 1 

yields 

o  «  a(yVa)1  (ytzta)  (zta)_1  zt  -  aq(zta)_1  zt 

where 

Q  -  (YTZTA)1  (yVa) 
is  a  projection  operator. 

This  result  is  a  special  case  of  the  following.  A  generalized  inverse  of 
a  nonsingular  matrix  X  is  equal  to  the  product  of  an  idempotent  operator  and 
the  true  inverse 

x1  -  fx^lx"1  -  x^xx1 


The  left  and  right  idempotent  operators  are  the  projectors  (X*X)  and  (XX*) 
respectively.  If  the  Moore  Penrose  inverse  is  used  the  idempotent  operators 
are  the  identity  operators. 


The  projection  operators  developed  for  constrained  least  squares 
applications  are  also  expressible  in  terms  of  1,  2,  4  generalized  inverses. 
The  problem  of  minimizing  AX  *  b  subject  to  GX  «  0  had  the  solution  Pgb 
where 


T  T 

PG  >  aq(a  a]  a1 

and 

Q  -  1  -  (ATA)-1  GT[G(ATA}"1  GT]  G 

Substituting  H  ■  (ATA_1 )  GT  and  s'*  *  G  yields  Q  =  1  -  H(StH)-1  St 
with  the  1,  2,  4  generalized  inverse  (sTh”1)  St. 

4.8  Weighted  Features  and  Subspace  Methods 

The  incorporation  of  feature  weighting  into  oblique  projectors  was 
considered.  Justification  for  its  use  is  provided  below  from  a  standpoint  of 
least  squares  theory.  The  weighting  of  features  is  based  on  the  statistical 
description  of  the  measurement  errors.  The  assumptions  are  that  the  average 
values  of  errors  are  zero  and  that  the  variances  and  covariances  are  known. 

If  the  model  R  *  b-Ax  is  adequate,  the  errors,  Rj,  associated  with  each  row 
or  measurement  will  be  unbiased.  By  this  is  meant  that  with  an  ensemble  of 
repeated  measurement  of  b^  the  set  of  will  have  zero  mean.16  Using 
this  ensemble  the  covariance  matrix  of  the  errors,  Er,  can  be  formed.  For 
obvious  reasons  the  covariance  matrix  is  often  not  known.  In  any  event  the 
estimate  of  the  errors  in  the  solution  vector  x  is  related  to  the  covariance 
matrix,  Er.  The  ensemble  of  solution  vectors  {X^}  has  a  covariance  matrix 


where  A1  is  the  pseudoinverse  of  A.  For  unweighted  least  squares  the 
inverse  is  the  Moore  Penrose  inverse,  A*  =  (A^A)_1A^.  For  weighted 
least  squares  the  pseudoinverse  can  be  written  A*  =  (A^WA)-1A^W,  where  W 
is  the  weight  matrix.  The  weight  matrix  must  be  positive  definite.  The 
covariance  matrix  for  the  solution  vector,  £y  given  by 

Ex  “  (aTWA)_1  (ATW  Er  WA)  (aTWA)_1  (2) 

It  is  the  diagonal  elements  of  Ex>  a  jji  that  give  the  errors  associated 
with  X j . 

Gauss'  Theorem  states  an  important  criteria  for  selection  of  weights. 
The  theorem  states  that  the  weight  which  will  give  minimum  variance  is  the 
inverse  of  the  covariance  of  the  measurement  errors.  That  is  if  W  »  ER-1 
then  the  a2jj  will  be  a  minimum.  This  intuitively  makes  sense  as 
measurements  with  large  variances  will  have  small  weights  and  measurements 
with  small  variances  will  be  weighted  heavily.  The  minimum  variance 
covariance  matrix  for  X  Is  given  by 


Er  is  usually  not  known  and  the  assumption  of  it  being  a  constant 
matrix  o2l  is  ubiquitious  since  it  shows  no  prejudice  against  any 
measurements.  The  unweighted  least  squares  procedure  will  give  an  unbiased 
estimate  of  X  and  if  the  error  covariance  is  a  constant  matrix,  the  unweighted 
least  squares  will  give  a  minimum  variance  estimate  of  X.  The  penalty  for 
using  the  wrong  weight  (W  *  E^-1)  Is  the  loss  of  minimum  variance.  A 
critical  question  concerning  minimum  variance  is  how  sensitive  is  the  variance 
to  the  wrong  weight.  (Note:  no  weight  at  all  W  =  1  is  the  wrong  weight  if 
Er  is  not  a  constant  matrix.)  The  selection  of  measurements  or  weighting  of 
measurements  is  a  standard  approach  in  pattern  recognition.  The  weighting  or 
selection  is  based  more  on  usefulness  in  distinguishing  classes  rather  than  on 
concerns  about  measurement  error.  The  general  rather  than  accidental  success 
of  such  procedures  would  require  that  this  type  of  measurement  weighting  does 
not  have  a  large  effect  on  the  covariance.  A  simple  example  lends  support  to 
this  idea.  Consider  a  two  class  problem  and  a  simple  minimum  distance 
classifier  based  on  b  =  Ax  where  A  is  m  x  2  whose  columns  aj  and  a2  are  the 
average  vectors  for  class  1  and  class  2  respectively.  The  m  features  are 
measured  with  equal  precision  and  the  features  are  uncorrelated.  The 
covariance  matrix  of  the  errors  can  be  expressed  as  Er  =  a  I. 


Even  though  the  m  measurements  are  of  equal  precision,  only  the  first  will  be 
useful  in  distinguishing  between  the  two  classes.  For  the  minimum  distance 
classifier  Xj  >  X2  if  b  belongs  to  class  1. 


For  this  example 


-»2) 

the  condition  number  Cond(ATA)  *  m-1  and  the  covariance  can  be  calculated 
from  (4)  as 


cf2(ATA) 


-1 


2 

a 

4  (m-1  j 


m  2-m 


2-m  m 


(7) 


The  uncertainties  in  X!  and  X2  are  obtained  from  the  diagonal  elements 


and  are  independent  of  m  for  large  m.  The  Inclusion  of  a  large  number  of 
precisely  measured  but  unuseful  measurements  does  not  reduce  but  does 
contribute  to  an  increase  in  the  condition  number  and  to  the  correlation 
between  class  1  and  class  2. 

A  weighting  scheme  is  used  which  attempts  to  minimize  the  apparent 
correlation  between  classes  by  minimizing  the  condition  number.  The  following 
diagonal  weight  matrix  will  cause  the  apparent  correlation  to  be  zero. 


w  is  a  mxm  diagonal  matrix.  The  model  now  uses  a  weighted  least  squares 
calculation.  The  matrix  (A^WA)  is  diagonal  and  class  1  appears  uncorrelated 
with  class  2.  The  condition  number  Cond(ATWA)  ■  1,  an  absolute  minimum.  We 


(atwa) 


2  0 


(atwa  ) 


1  «  (aTWA)  ATWb 


Since  these  weights  are  not  the  inverse  of  the  covariance  of  the  measurement 
errors,  the  covariance  of  X  must  be  calculated  from  Eq.  (2)  rather  than  Eq. 
(3)  and  in  principle  will  not  yield  the  minimum  variances.  However 
substitution  of  the  required  quantities  in  Eq.  (2)  yields  the  same  estimate 
for  £x  as  was  obtained  in  Eq.  (7).  The  use  of  the  incorrect  weights  in  this 
model  has  not  had  an  adverse  effect  on  the  solution  variances.  The  weighting 
generates  a  numerically  stable  model  with  a  well  conditioned  normal  matrix 
which  is  easily  inverted.  This  example  offers  some  hope  for  the  use  of 
weighting  schemes  which  minimize  apparent  correlation  or  minimize  the 
importance  of  measurements  which  are  not  useful  for  classification. 

The  use  of  weights  in  oblique  projection  techniques  is  attractive.  The 
weighted  projection  method  can  be  formulated  so  that  once  the  projector  is 
generated  during  a  learning  step,  no  additional  burden  will  be  incurred.  The 
classification  step  is  as  simple  as  in  the  unweighted  case.  The  form  of  the 
oblique  projection  operator  used  here  is  0^  “  A(ZtA)1z^  where  Z  is  a 
basis  for  the  projection  onto  A  which  is  orthogonal  to  $3.  That  is  ZTB  * 

A 

0  for  any  vector  B  that  belongs  to  413.  When  using  weighted  measurements  a  Z 
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is  sought  which  is  W  orthogonal  to  (j>g.  In  the  unweighted  case  Z  is 
constructed  from 


Z  *  A  -  B(BTB)“1  BTA 


For  the  weighted  case  the  desired  Z  is 

z  »  wa  -  wb(btwb)-1  (atwb) 


Then 


ztb  »  atwb  -  (btwb)  (btwb)"1  (aTWB)  **  0 
and  the  weighted  projection  is  given  by 


a(zta)1 


0A  works  directly  on  unweighted  vectors.  Thus  no  additional  burden  over 

A 

unweighted  projections  occurs  once  Z  and  0  are  formed. 
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